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ABSTRACT 

Reasons  are  advanced  for  the  belief  that  scientific  method  employs  and 
requires  not  one,  but  two  kinds  of  inference  -  criticism  and  estimation;  once 
this  is  understood  the  statistical  advances  made  in  recent  years  in  Bayesian 
methods,  data  analysis,  robust  and  shrinkage  estimators  can  be  seen  as  a 
cohesive  whole. 
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SIGNIFICANCE  AND  EXPLANATION 


For  many  years  there  has  existed  &  major  controversy  among  statisticians 
concerning  whether  Bayesian  theory  or  Sampling  (frequentist)  theory  was 
appropriate  for  making  statistical  inferences.  The  roles  of  data  analysis  and 
of  robust  and  shrinkage  estimators  have  also  been  matters  of  dispute. 

Building  on  results  from  an  earlier  paper  it  is  here  argued  that  a  study  of 
scientific  method  and  of  the  part  played  in  it  by  the  human  brain  shows  that 
two  different  kinds  of  statistical  inference  -  estimation  and  criticism  are 
needed  from  which  Bayes  and  Sampling  theory  respectively  are  uniquely 
appropriate.  This  point  of  view  also  shows  how  data  analysis,  robust  and 
shrinkage  estimators  all  have  appropriate  parts  to  play  in  the  iterative 
scheme  of  scientific  enquiry. 
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AN  APOLOGY  FOP.  ECUMENISM  IN  STATISTICS 


G.  E.  P.  BOX 


Perhaps  I  should  begin  with  an  apology  for  my  title. 
These  days  the  statistician  is  often  asked  such  questions  as 
“Are  you  a  Bayesian?”  "Are  you  a  frequentist?"  "Are  you  a 
data  analyst?"  "Are  you  a  designer  of  experiments?"  I  will 
argue  that  the  appropriate  answer  to  all  these  questions  can 
be  (and  preferably  should  be)  "yes",  and  that  we  can  see  why 
this  is  so  if  we  consider  the  scientific  context  of  what 
statisticians  do. 

For  many  years  Statistics  has  seemed  to  be  in  a  rather 
turbulent  state  and  the  air  has  been  full  of  argument  and 
controversy.  The  relative  virtue  of  alternative  methods  of 
inference  and,  in  particular,  of  Bayes'  and  Sampling 
(frequentist)  inference  has  been  hotly  debated.  Recently 
Data  Analysis  has  rightly  received  much  heavier  emphasis, 
but  its  more  avid  proponents  have  sometimes  seemed  to 
suggest  that  all  else  is  worthless.  Furthermore  while 
biased  estimators,  in  particular  shrinkage  and  ridge 
estimators,  which  have  been  advocated  to  replace  the  more 
standard  varieties  are  clearly  sensible  in  appropriate 
contexts  their  frequentist  justification  which  ignores 
context  seems  unconvincing.  Parallel  criticism  may  be  made 
of  ad  hoc  robust  procedures  the  proliferation  of  which  has 
worried  some  dissidents  who  have  argued  for  example  that 
mechanical  downweighting  of  peculiar  observations  may  divert 
attention  from  important  clues  to  new  discovery. 
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Insofar  as  these  debates  lead  us  to  progressive  change 
in  our  ideas  they  are  healthy  and  productive,  but  insofar  as 
they  encourage  polarization  they  may  not  be.  One  remembers 
with  some  misgivings  Saxe's  poem  about  the  six  blind  men  of 
Hindustan  investigating  an  elephant.  It  will  be  recollected 
that  one,  feeling  only  the  elephant’s  trunk,  thought  it  like 
a  snake,  another,  touching  its  ear,  thought  it  must  be  a 
fan,  etc.  The  poem  ends: 

And  so  these  men  of  Hindustan 
Disputed  loud  and  long. 

Each  in  his  opinion 

Exceeding  stiff  and  strong. 

Though  each  was  partly  in  the  right. 

And  all  were  in  the  wrong. 

Some  of  the  difficulties  arise  from  the  need  to 
simplify.  But  simplification  included  merely  to  produce 
satisfying  mathematics  or  to  reduce  problems  to  convenient 
small  sized  pieces  can  produce  misleading  conclusions. 
Simplification  which  retains  the  essential  scientific 
essence  of  the  problem  is  most  likely  to  lead  to  useful 
answers  but  this  requires  understanding  of,  and  interest  in, 
scientific  context. 

1.  SOME  QUESTIONABLE  SIMPLIFICATIONS. 

(a)  It  has  been  argued  that  Bayes'  theorem  uniquely 
solves  all  problems  of  inference.  However  only  part  of  the 
inferential  exercises  in  which  the  statistical  scientist  is 
ordinarily  engaged  seem  to  conveniently  fit  the  Bayesian 
mold.  In  particular  diagnostic  checks  of  goodness  of  fit 
involving  various  analyses  of  residuals  seem  to  require 
other  justification.  In  fact  I  believe  (Box  [1980])  that 
the  process  of  scientific  investigation  involves  not  one  but 
two  kinds  of  inference:  estimation  and  criticism,  used 
iteratively  and  in  alternation.  Bayes  completely  solves  the 
problem  of  estimation  and  can  also  be  helpful  at  the 
criticism  stage  in  judging  the  relative  plausibility  of  two 
or  more  models.  However  because  of  its  necessarily 
conditional  nature,  it  cannot  deal  with  the  most  essential 
part  of  inferential  criticism  which  requires  a  sampling 
(frequentist)  justification. 
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(b)  Fisher  [1956]  believed  that  the  Neyman-Pearson 
theory  for  testing  statistical  hypotheses,  while  providing  a 
model  for  industrial  quality  control  and  sampling 
inspection,  did  not  of  itself  provide  an  appropriate  basis 
for  the  conduct  of  scientific  research.  This  can  be 
regarded  as  the  complement  to  the  objection  raised  in  (a), 
for  statistical  quality  control  and  inspection  are  methods 
of  inferential  criticism  supplying  a  continuous  check  on  the 
adequacy  of  fit  of  the  model  for  the  properly  operating 
process.  I  would  regard  Fisher's  comment  as  meaning  that 
the  Neyman-Pearson  theory  was  irrelevant  to  problems  of 
estimation.  Certainly  there  is  evidence  in  the  social 
sciences  that  excessive  reliance  upon  this  theory  alone, 
encouraged  by  the  mistaken  prejudices  of  referees  and 
editors,  has  led  to  harmful  distortion  of  the  conduct  of 
scientific  investigation  in  these  fields. 

(c)  In  some  important  contexts  the  scientific 
relevance  of  alphabetic  optimality  criteria  (A,E,D,G  etc.) 
in  the  choice  of  experimental  designs  has  been  questioned 
(see  discussion  of  Kiefer  [1959],  also  Box  [1982]).  Here 
again  there  is  danger  of  deleterious  feedback  since  users  of 
statistical  design,  perhaps  dazzled  by  impressive  but  poorly 
comprehended  mathematics,  may  fail  to  realize  the  naive 
framework  within  which  the  optimality  occurs. 

(d)  Even  Data  Analysis,  excellent  in  itself,  presents 
some  dangers.  It  is  a  major  step  forward  that  in  these  days 
students  of  statistics  are  required  more  and  more  to  work 

on  real  data.  Indeed  suitable  "data  sets"  have  been  set 
aside  for  their  study.  But  this  too  can  produce  misunder¬ 
standing.  For  instance,  some  examples  have  become  notorious 
and  have  been  analyzed  by  a  plethora  of  experts;  one 
finds  three  outliers,  another  claims  that  a  transformation 
is  needed  and  then  only  one  outlier  occurs,  and  so  on. 

Too  much  exposure  to  this  sort  of  thing  can  again  lead  to 
the  mistaken  idea  that  this  represents  the  real  context 
of  scientific  investigation.  The  statistician  in  his 
proper  role  as  a  member  of  a  scientific  team  should 
certainly  make  such  analyses,  but  realistically  he  would 
then  discuss  them  with  his  scientific  colleagues  and 
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present,  when  appropriate,  not  one,  but  alternative 
plausible  possibilities.  He  need  not,  and  usually  should 
not,  choose  among  them.  Rather  he  should  make  sure  that 
these  possibilities  were  considered  when  he  and  his 
scientific  colleagues  planned  the  next  stage  of  the 
investigation.  Together  they  would  choose  the  next  design 
so  that  among  other  things  it  could  resolve  current 
uncertainties  judged  to  be  important.  In  particular  the 
possible  meaning  and  importance  of  discrepant  values  would 
then  be  discussed  as  well  as  the  meaning  of  analyses  which 
downweighted  or  excluded  them. 

The  most  dangerous  and  misleading  of  the  unstated 
assumptions  suggested  to  some  extent  by  all  these 
simplifications  concerns  the  implied  static  nature  of  the 
process  of  investigation!  A_  Bayesian  analysis  is  made;  _a 
hypothesis  is  tested;  one  model  is  considered;  a  single 
design  is  run;  a  single  set  of  data  is  examined  and 
reexamined ( 1 ^ . 

I  believe  that  the  object  of  statistical  theory  should 
be  to  explain,  at  least  approximately,  what  good  scientists 
do  and  to  help  them  do  it  better.  It  seems  necessary 
therefore  to  examine  at  least  briefly  the  nature  of  the 
scientific  process  itself. 

2.  SCIENTIFIC  METHOD  AND  THE  HUMAN  BRAIN. 

Scientific  method  is  a  formalization  of  the  everyday 
process  of  finding  things  out.  For  thousands  of  years, 
things  were  found  out  largely  as  a  result  of  chance 
occurrences.  For  a  new  "natural  law”  to  be  discovered,  two 


^While  provision  is  made  for  adaptive  feedback  in  data 
analysis,  usually  the  possibility  of  acquiring  further  data 
to  illuminate  points  at  issue  is  not.  What  we  do  as 
statisticians  depends  heavily  on  expectations  implied  by  our 
training.  While  a  previous  generation  of  graduates  might 
have  expected  to  prove  theorems,  occasionally  to  test  an 
isolated  hypothesis,  and  perhaps  to  teach  a  new  generation 
of  students  to  do  likewise,  the  present  generation  might  be 
forgiven  for  believing  that  their  fate  is  only  to  explore 
"data  sets"  and  speculate  on  what  might  or  might  not  explain 
them.  We  must  encourage  our  students  to  accept  the  heritage 
bestowed  by  b'isher,  who  elevated  the  statistician  from  an 
archivist  to  an  active  designer  of  experiments  and  hence  an 
architect  and  coequal  investigator. 


-4- 


circumstances  needed  to  coincide:  (a)  a  potentially 
informative  experience  needed  to  occur,  and  (b)  the 
phenomenon  needed  to  be  known  about  by  someone  of  sufficient 
acuity  of  mind  to  formulate,  and  preferably  to  test,  a 
possible  rule  for  its  future  occurrence. 

Progress  was  slow  because  of  the  rarity  of  the  two 
necessary  individual  circumstances  and  the  still  greater 
rarity  of  their  coincidence.  Experimental  science  accel¬ 
erates  this  learning  process  by  isolating  its  essence: 
potentially  informative  experiences  are  deliberately 
staged  and  made  to  occur  in  the  presence  of  a  trained 
investigator.  As  science  has  developed,  we  have  learned  how 
such  artificial  experiences  may  be  carefully  contrived  to 
isolate  questions  of  interest,  how  conjectures  that  are  put 
forward  may  be  tested,  and  how  residual  differences  from 
what  had  been  expected  can  be  used  to  modify  and  improve 
initial  ideas.  So  the  ordinary  process  of  learning  has  been 
sharpened  and  accelerated. 

The  instrument  of  all  learning  is  the  brain  -  an 
incredibly  complex  structure,  the  working  of  which  we  have 
only  recently  begun  to  understand.  One  thing  that  is  clear 
is  the  importance  to  the  brain  of  models.  To  appreciate  why 
this  is  so,  consider  how  helpless  we  would  be  if,  each 
night,  all  our  memories  were  eliminated,  so  that  we  awoke  to 
each  new  day  with  no  past  experiences  whatever  and  hence  no 
models  to  guide  our  conduct.  In  fact,  our  past  experience 
is  conveniently  accumulated  in  models  M^Mj,  •  •  •  ,  M^, . . . . 

Some  of  these  models  are  well  established,  others  less  so, 
while  still  others  are  in  the  very  early  stages  of  creation. 
When  some  new  fact  or  body  of  facts  y^  comes  to  our 
attention,  the  mind  tries  to  associate  this  new  experience 
with  an  established  model.  When,  as  is  usual,  it  succeeds 
in  doing  so,  this  new  knowledge  is  incorporated  in  the 
appropriate  model  and  can  set  in  train  appropriate  action. 

Obviously,  to  avoid  chaos  the  brain  must  be  good  at 
allocating  data  to  an  appropriate  model  and  at  initiating 
the  construction  of  a  new  model  if  this  should  prove  to  be 
necessary.  To  conduct  such  business  the  mind  must  be  able 


to  deduce  what  facts  could  be  expected  as  realizations  of  a 
particular  model  and,  more  difficult,  to  induce  what 
model (s)  are  consonant  with  particular  facts. 

Thus,  it  is  concerned  with  two  kinds  of  inference: 

(A)  Contrasting  of  new  facts  yd  with  a  possible  model 
M:  an  operation  I  will  characterize  by  subtraction 
M  -  y^.  This  process  stimulates  induction  and  will  be 
called  criticism.  (B)  Incorporating  new  facts  yd  into  an 
appropriate  model:  an  operation  I  will  characterize  by 
addition  M  +  y^.  This  process  is  entirely  deductive  and 
will  be  called  estimation. 

I  believe  then  that  many  of  our  difficulties  arise 
because,  while  there  is  an  essential  need  for  two  kinds  of 
inference,  there  seems  an  inherent  propensity  among 
statisticians  to  seek  for  only  one. 

In  any  case,  research  which,  following  the  discoveries 
of  Roger  Sperry  and  his  associates,  has  gathered  great 
momentum  in  the  past  25  years  shows  that  the  human  brain 
behaves  not  as  a  single  entity  but  as  two  largely  separate 
but  cooperating  instruments  which  do  two  different  things 
(see  for  example  Springer  and  Deutsch  [1981],  Blackeslee 
[1980]). 

In  most  people^ the  left  brain  is  concerned 
primarily  with  language  and  logical  deduction,  the  right 
brain  with  images,  patterns  and  inductive  processes.  The 
two  sides  of  the  brain  are  joined  by  millions  of  connections 
in  the  corpus  callosum.  It  is  known  that  the  left  brain 
plays  a  conscious  and  dominant  role  while  by  contrast  one 
may  be  quite  unaware^ of  the  working  of  the  right  brain. 


^2^In  about  one  third  of  left-handed  people  (about  5%  of  the 
population)  the  roles  of  the  right  and  left  brain  are 
reversed. 

(•^For  example  the  apparently  instinctive  knowledge  of  what 
to  do  and  how  to  do  it,  enjoyed  by  the  experienced  tennis 
player  and  by  the  experienced  motorist,  comes  from  the  right 
brain. 


The  right  brain's  ability  to  appreciate'*'  patterns  in 
data  yd  and  to  find  patterns  in  discrepancies  -  y^ 
between  the  data  and  what  might  be  expected  if  some 
tentative  model  were  true  is  of  great  importance  in  the 
search  for  explanations  of  data  and  of  discrepant  events. 
This  accomplishment  of  the  right  brain  of  pattern 
recognition  is  of  course  of  enormous  consequence  in 
scientific  discovery However,  some  check  is  needed  on 
its  pattern  seeking  ability,  for  common  experience  shows 
that  some  pattern  or  other  can  be  seen  in  almost  any  set  of 
data  or  facts^6^.  This  is  the  object  of  diagnostic  checks 
and  tests  of  fit  which,  I  will  argue,  require  frequentist 
theory  significance  tests  for  their  formal  justification. 

3 .  THE  THEORY  -  PRACTICE  ITERATION. 

It  has  long  been  recognized  that  the  learning  process 
is  a  motivated  iteration  between  theory  and  practice.  By 
practice  I  mean  reality  in  the  form  of  data  or  facts.  In 
this  iteration  deduction  and  induction  are  employed  in 
alternation.  Progress  of  an  investigation  is  thus  evidenced 
by  a  theoretical  model,  which  is  not  static,  but  by 
appropriate  exposure  to  reality  continually  evolves  until 


Implicit  recognition  of  the  need  to  stimulate  the 
remarkable  pattern-seeking  ability  of  the  right  brain  is 
evidenced  by  modern  emphasis  on  ingenious  plotting  devices 
in  the  model  formulation/modification  phases  of 
investigation.  In  particular  Chernoff's  representation  of 
multivariate  data  by  faces  [1973]  and  earlier  Edgar 
Anderson's  use  of  glyphs  [1960]  direct  the  right  brain  to 
the  recognition  problem  at  which  it  excels. 

^Manifestations  of  the  importance  to  discovery  of 
unconscious  pattern  seeking  by  the  right  brain  have  often 
been  noticed.  For  example,  Beveridge  [1950]  remarks  that 
happenings  of  the  following  kind  are  commonplace:  a 
scientist  has  mulled  over  a  set  of  data  for  many  months  and 
then,  at  a  certain  point  in  time,  perhaps  on  a  country  walk 
when  the  problem  is  not  being  consciously  thought  about,  he 
suddenly  becomes  aware  of  a  solution  (model)  which  explains 
these  data.  This  point  in  time  is  presumably  that  at  which 
the  right  brain  sees  fit  to  let  the  left  brain  know  what  it 
has  figured  out. 

^^See,  for  example,  the  King  of  Heart's  rationalization  of 
the  poem  brought  as  evidence  in  the  trial  of  tne  Knave  of 
Hearts  in  Lewis  Carrol's  Alice  in  Wonderland. 


some  currently  satisfactory  level  of  understanding  is 
reached.  At  any  given  stage  in  a  scientific  investigation 
the  current  model  helps  us  to  appreciate  not  only  what  we 
know,  but  what  else  it  may  yet  be  important  to  find  out  and 
so  motivates  the  collection  of  new  data  to  illuminate  dark 
but  possibly  interesting  corners  of  present  knowledge.  See 
for  example  Box  and  Youle  [1955],  Box  [1976],  Box,  Hunter 
and  Hunter  [1978]. 

The  reader  can  find  illustration  of  these  matters  in 
his  everyday  experience,  or  in  the  evolution  of  the  plot  of 
any  good  mystery  novel,  as  well  as  in  any  reasonably  honest 
account  of  the  events  leading  to  scientific  discovery. 
Different  levels  of  adaptation 

The  adaptive  iteration  we  have  described  produces 
change  in  what  we  believe  about  the  system  being  studied, 
but  it  can  also  produce  change  in  how  we  study  it,  and 
sometimes  even  in  the  objective^)  of  the  study.  This 
multiple  adaptivity  explains  the  surprising  property  of 
convergence  of  a  process  of  investigation  which  at  first 
appears  hopelessly  arbitrary.  See  for  example  Box  [1957]. 

To  appreciate  this  arbitrariness,  suppose  that  some 
scientific  problem  were  being  studied  by,  say,  10 
independent  sets  of  investigators,  all  competent  in  the 
field  of  endeavor.  It  is  certain  that  they  would  start  from 
different  points,  conduct  the  investigation  in  different 
ways,  have  different  initial  ideas  about  which  variables 
were  important,  on  what  scales, and  in  which  transfor¬ 
mation.  Yet  it  is  perfectly  possible  that  they  would  all 
eventually  reach  similar  conclusions.  It  is  important 
to  bear  this  context  of  multiple  iteration  in  mind 
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'  'If  we  start  out  to  prospect  for  silver,  we  should  not 
ignore  an  accidental  discovery  of  gold.  For  example,  one 
experimental  attempt  to  find  manufacturing  conditions  giving 
greater  yield  of  a  particular  product  failed  to  find  any 
such,  but  did  find  reaction  conditions  giving  the  same  yield 
with  the  reaction  time  halved.  This  meant  that,  by 
switching  to  the  new  manufacturing  conditions,  throughput 
could  be  doubled,  and  that  a  costly,  previously  planned, 
extension  of  the  plant  was  unnecessary. 
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when  we  consider  the  scientific  process  and  how  it  relates 
to  a  statistical  method. 

4.  STATISTICAL  ESTIMATION  AND  CRITICISM. 

In  a  recent  paper  (Box  [1980])  a  statistical  theory  was 
presented  which,  it  was  argued,  was  consonant  with  the  view 
of  scientific  investigation  outlined  above.  Suppose  at  the 
i^  stage  of  such  an  investigation  a  set  of  assumptions 
A^  are  tentatively  entertained  which  postulate  that  to  an 
adequate  approximation,  the  density  function  for  potential 
data  y  is  p(yl8»A^)  and  the  prior  distribution  for  0 
is  p(6|A^).  Then  it  was  argued  that  the  model  should 

be  defined  as  the  joint  distribution  of  y  and  0 

p(y#0|Ai)  =  P^IS/A^pfe  I  A±)  (1) 

since  it  is  a  complete  statement  of  prior  tentative  belief 
at  stage  i.  In  these  expressions  A^  is  understood  to 
indicate  all  or  some  of  the  assumptions  in  the  model 
specification  at  stage  i.  The  model  of  equation  (1)  means 
to  me  that  current  belief  about  the  outcome  of  contemplated 
data  acquisition  would  be  calibrated  with  adequate 
approximation  by  a  physical  simulation  involving  appropriate 
random  sampling  from  the  distributions  ptylOjA^  and 
P(S  lAi). 

The  model  can  also  be  factored  as 

P(¥»®lA)  *  P(9 ly#A)p(y|A)  .  (2) 

The  second  factor  on  the  right,  which  can  be  computed  before 
any  data  become  available, 

p(ylA)  =  /  p(y|0 ,A)p(9 |A)d6  (3) 

is  the  predictive  distribution  of  the  totality  of  all 
possible  samples  y  that  could  occur  if  the  assumptions 
were  true. 

When  an  actual  data  vector  y^  becomes  available 
P(*d*£lA)  =  p(8 lyd,A)p(ydlA)  .  (4) 

The  first  factor  on  the  right  is  the  Bayes'  posterior 
distribution  of  0  given  y^ 

P(£l}fd/A)  «  p(ydl£/A)p{0  |A) 
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(5) 


while  the  second  factor 


p(XdlA)  “  /  p(Ydl® *A)p(0 |A)dO  ,  (6) 

is  the  predictive  density  associated  with  the  particular 
data  yd  actually  obtained  conditional  on  the  truth  of  the 
model  and  on  the  data  having  occurred. 

The  posterior  distribution  p(9lyd»A)  allows  all 
relevant  estimation  inferences  to  be  made  about  6,  but 

*N*  ' 

this  posterior  distribution  can  supply  no  information  about 
the  adequacy  of  the  model.  Information  on  adequacy  may  be 
provided,  however,  by  reference  of  the  density  p(y^|A)  to 
the  predictive  reference  distribution  p(ylA)  or  of  the 
density  p(g^(y^)lA}  of  some  relevant  checking  function 
g^Yd)  to  predictive  distribution  and  in  particular  by 
computing  the  probabilities 


Pr{p(y|A)  <  p(ydlA)} 

(7) 

PrCp{gi(y) 1 A}  <  p(gi(yd)lA}] 

(8) 

Two  illustrative  examples  follow. 

4.1.  The  Binomial  Model 

As  an  elementary  example,  suppose  inferences  are  to  be 
made  about  the  proportion  O  of  successes  in  a  set  of 
binomial  trials. 

Suppose  n  trials  are  about  to  be  made  and  assume  a 
beta-distribution  prior  with  mean  0Q.  Then 

.  m0n-l  m(l-0n)-l 

p(0  I  A)  =  [BUna^ma-OQ)}]-1©  0  (1-0)  0  (9) 

p(y|8,A)  =  (y)8y(l  -  6 )n-y  (10) 

and  the  predictive  distribution  is 

p(yl A)=(“) [B{m0o,m(l-0o)} ]“1B{m0o+y, m( l-0Q)+n-y}  (11) 

which  may  be  computed  before  the  data  are  obtained. 

If,  now,  having  performed  n  trials,  there  are  yd 
successes,  the  likelihood  defined  up  to  a  multiplicative 
constant  is 


the  predictive  density  is 


p(ydlA)=(y  ) CB{meo<m(l-0o)> ]"1B{m8o+yd#m(l-0o)+n-yd}  (13) 

and  the  posterior  distribution  of  8  is 


p(0lyd#A)  =  [B{m0 Q  +  ydlm(l-0o)  +  n  -  y^]”1  * 
0yd+meO-1(1  _  Q)n-yd+m(l-0o)-l  . 


(14) 


In  the  examples  of  Figures  1  and  2  full  lines  are  used  for 
items  available  prior  to  the  availability  of  data  yd  and 
dotted  lines  for  items  available  only  after  the  data  yd 
are  in  hand.  Both  Figures  1  and  2  illustrate  a  situation 
where  the  prior  distribution  p(0|A)  has  mean  0Q  =  0.2 
and  m  *  20  and  we  know  that  n  ■  10  trials  are  to  be 
performed.  Knowing  these  facts ,  we  can  immediately 
calculate  the  predictive  distribution  p(y|A)  which  is  the 
probability  distribution  for  all  possible  outcomes  from  such 
a  model  if  we  suppose  the  model  is  true. 

When  the  experiment  is  actually  performed  suppose  at 
first,  as  in  Figure  1,  that  yd  =  3  of  the  trials  are 
successes.  The  predictive  probability  p(3|A)  associated 
with  this  outcome  is  not  unusually  small.  In  fact 
Pr{p(y|A)  <  p ( 3 1  A) }  =  0.42  and  we  have  no  reason  to 
question  the  model.  Thus  for  this  sample  the  likelihood 
L(0|y)  may  reasonably  be  combined  with  the  prior  to 
produce  the  posterior  distribution  shown. 

In  Figure  2  however  it  is  supposed  instead  that  the 
outcome  is  yd  =  8  successes  so  that  for  this  sample 
Pr(p(y|A)  <  p ( 8 | A ) }  =  0.0013  and  the  adequacy  of  the 
model,  and  in  particular  the  adequacy  of  the  prior 
distribution,  is  now  called  into  question.  Inspection  of 
the  figure  shows  how  this  agrees  with  common  sense;  for  in 
the  case  illustrated  the  posterior  distribution  is  unlike 
either  the  prior  distribution  or  the  likelihood  which  were 
combined  to  obtain  it. 
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Pr{p(y|A)s  p(yd|A)} 
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Figure  1.  Prior,  likelihood,  predictive  and  posterior 
distributions  for  n  =  10  Bernoulli  trials 
with  yj  =  3  successes. 


Pr{p(ylA)  s  p(yd  I  A)}  =  0.0013 
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Figure  2.  Prior,  likelihood,  predictive  and  posterior 
distributions  for  n  =  10  Bernoulli  trials 
with  y  ,  =  8  successes. 


Misgivings  about  the  use  of  Bayes '  theorem  which  some 
have  expressed  in  the  past  are  certainly  associated  with  the 
possibility  of  distorting  the  information  coming  from  the 
data  by  the  use  of  an  inappropriate  prior,  distribution. 
Without  predictive  checks,  the  following  objections  would 
carry  great  weight: 

(a)  that  nothing  in  the  Bayes*  calculation  of  the 
posterior  distribution  itself  could  warn  of  the 
incompatibility  of  the  data  and  the  model,  and  especially 
the  prior;  and 

(b)  that  in  complicated  examples  it  would  not  be  so 
obvious  when  this  incompatibility  occurred. 

A  case  of  particular  interest  occurs  when  the  prior  is 
sharply  centered^®)  at  its  mean  value  0q  =  0.2.  This 
happens  in  the  above  binomial  setup  when  m  is  made  very 
large.  Then,  the  model  is  unquestioned,  the  posterior 
distribution  will  be  essentially  the  same  as  the  prior 
leading  to  the  conclusion  that  0  is  close  to  0q  whatever 
the  data.  The  predictive  distribution  in  this  case  is 
p(yl0o»A),  the  ordinary  binomial  sampling  distribution, 
and  the  predictive  check  is  the  standard  binomial 
significance  test,  which  can  discredit  the  model  with 
0  =  0Q  =  0.2  and  hence  discredit  the  application  of  Bayes* 
theorem  to  this  case.  This,  to  my  mind,  produces  the  most 
satisfactory  justification  for  the  standard  significance 
test . 

4.2.  The  Normal  Linear  Model  and  Ridge  Estimators 

Another  example,  discussed  in  Box  [1980],  concerns  the 
normal  linear  model.  In  a  familiar  notation  suppose 

v  -  N(lw  +  X0,I  a2)  (15) 

■*  ~  ~~  --n 

with  1  a  vector  of  unities  and  X  of  full  rank  k  and 

•*W  *N* 

such  that  X'l  =  0  and  suppose  that  prior  densities  are 
locally  approximated  by 


^®^Such  a  model  with  a  prior  sharply  centered  at  g  -  0.2  might 
be  appropriate,  for  instance,  if  a  trial  consisted  of  spinning  ten 
times  what  seemed  to  be  a  properly  balanced  pentagonal  top  and 
counting  the  number  of  times  the  top  fell  on  a  particular  segment. 
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(16) 


W''N(y0<c_1o2),9-N(80,r’1a2) ,  {o2/v0s2)~x”2(vq) 

—2  2 
where  x  (vq)  refers  to  the  inverted  x  distribtuion 

with  vft  degrees  of  freedom  and  u  and  6  independent 

o  2 

conditional  on  a  . 

Given  a  sample  yd,  special  interest  attaches  to  8 

and  o2  which,  given  the  assumptions,  are  estimated  by 
2 

p(0»o  lyd,A)  with  marginal  distributions 

(£-|a)‘<JS'S+£)(»-la>  -  j(n+v0+lt) 
p(8IXd,A)  «  U  < - 3 - ^ - — 1  (17) 


with 


(n  +  v0)«; 

?  “(n+v_+2)  .  *29 

p(o2\xd,A)  «  o  exp{-  i  (n  +  vo*0d/°  ' 


ff,  ■  (x'x  +  r)-1(x'xe,  +  re.)  , 

**»Q  <V  W  VMQ  9 


(18) 


*0d  -  (3\5>‘V*d.  V  -  n  -  k  -  1  , 

(n+vQ)o2  =  vs2  +  vQs2  +  *  r1}“1(id-S0) 

+  (n"1  +  c"1)'1^  -  un)2  . 


(19) 


s2  -  (1  -  X(X*X>“V>X»  s2  -  {I  -  X(X*X)_1X')Yd  (20) 


Now  let 


s2  =  (w  +  v0)-W  +  vQs2)  and 
spd  =  (v  +  vQ)  (vsd  +  v0so>  • 


(21) 


Then  the  joint  predictive  distributio/j  can  be  factored  into 

A  2 

independent  components  for  (8  -  8_)/s  ,  s  ,  and  v  -  1 

u  p 

angular  elements  of  the  standarized  residuals.  A  predictive 
check  based  on  the  first  of  these  factors 


Pr(p((8  -  80)/sp|A)  <  p((8d  -  90)/spdlA)} 


Pr{  F 


(Sd-£0)'l(«'*)'l+  rh'hEa-So), 

*-v+v0  ’  ks2  ' 

pd 


(22) 
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is  the  standard  analysis  of  variance  check  for  compatibility 

A 

of  two  estimates  0^  and  0q  and  was  earlier  proposed  as  a 
check  for  compatibility  of  prior  and  sample  information  by 
Theil  [1963]. 

Now  suppose  the  X  matrix  to  be  in  correlation  form 

~  2  2 

and  assume  9.  =  0,r  =  I. y«,v_  *  0  so  that  s  *  s  .  Then 
~0  ~  ~k  0  0  p 

the  estimates  9^  are  the  ridge  estimators  of  Hoerl  and 
Kennard  [1970]  which,  given  the  assumptions,  appropriately 
combine  information  from  the  prior  with  information  from  the 
data.  The  predictive  check  (22)  now  yields 


0 '{ (X'X)-1  +  iy"1)-1^  , 
~ d  ~  ~  -'0  ~di 

a  =  PrfF  >  - 2 - - - 1 

K'  ks‘ 


(23) 


allowing  any  choice  of  Yq  to  be  criticized. 

For  example,  in  their  original  analysis  of  the  data  of 
Gorman  and  Toman  [1966],  Hoerl  and  Kennard  [1970]  chose  a 
value  Yq  *  0.25.  But  substitution  of  this  value  in  (23) 
yields  a  =  PrtF^Q  25  >  3.59}  <  0.01  which  discredits  this 
choice. 

One  can  see  for  these  examples  how  the  two  functions  of 
criticism  and  estimation  are  performed  by  the  predictive 
check  on  the  one  hand  and  the  Bayesian  posterior 
distribution  on  the  other. 

Thus  consider  the  ridge  (Bayes'  mean)  estimator  of  the 
second  example.  This  estimator  is  a  linear  combination  of 

A 

the  least  squares  estimate  9  and  the  prior  mean  0q,  with 
weights  supplied  by  the  appropriate  information  matrices, 
and  with  covariance  matrix  obtained  by  inverting  the  sum  of 
these  information  matrices.  Assuming  the  data  to  be  a 
realization  of  the  model,  this  is  the  appropriate  way  of 
combining  the  two  sources  of  information. 

The  predictive  check,  on  the  other  hand,  contrasts  the 

a  ~ ' 

values  9  and  9q  with  a  dispersion  matrix  obtained  by 
appropriately  summing  the  two  dispersion  matrices. 

The  combination  of  information  from  the  prior  and 
likelihood  into  the  posterior  distribution  and  the 
contrasting  of  these  two  sources  of  information  in  the 
predictive  distribution  is  equally  clear  in  the  binomial 


example  and  especially  in  its  appropriate  normal 
approximation. 

5.  SOME  OBJECTIONS  CONSIDERED 

A  recapitulation  of  the  argument  and  a  consideration  of 
some  objections  is  considered  in  this  section. 

5.1.  Essential  elements  of  the  argument 

A.  Scientific  investigation  is  an  iterative  process  in 
which  the  model  is  not  static  but  is  continually  evolving. 

At  a  given  stage  the  nature  of  the  uncertainties  in  a  model 
directs  the  acquisition  of  further  data,  whether  by  choosing 
the  design  of  an  experiment  or  sample  survey,  or  by 
motivating  a  search  of  a  library  or  data  bank.  At,  say,  the 
i^  stage  of  an  investigation  all  current  structural 
assumptions  A^,  including  those  about  the  prior,  must  be 
thought  of,  not  as  being  true,  but  rather  as  being 
subjective  guesses  which  at  this  particular  stage  of  the 
investigation  are  worth  entertaining.  It  is  consistent  with 
this  attitude  that  when  data  become  available  checks 

need  to  be  applied  to  assess  consonance  with  A^. 

B.  The  statistical  model  at  the  ith  stage  of  the 
investigation  should  be  defined  as  the  joint  distribution  of 

y  and  0  given  the  assumptions  A^ 

p(Y'£lAi)  *  P(* •£ fAi)p(0|Ai)  .  (24) 

C.  Not  one  but  two  distinct  kinds  of  inference  are 
involved  within  the  iterative  process:  criticism  in  which 
the  appropriateness  of  regarding  data  yd  as  a  realization 
of  a  particular  model  M  is  questioned?  estimation  in  which 
the  consequences  of  the  assumption  that  data  yd  are  a 
realization  of  a  model  M  are  made  manifest. 

This  criticism-estimation  dichotomy  is  characterized 
mathematically  by  the  factorization  of  the  model  realization 
p(yd,8|Ai)  into  the  predictive  density  p(ydlA^)  and  the 
posterior  distribution  p(9|yd,A^).  The  predictive 
distribution  p(y|A^)  provides  a  reference  distribution  for 
p^d^Ai^‘  Similarly  the  predictive  distribution 
p( g(y ) I Ai)  of  any  checking  function  g(y)  provides  a 
reference  distribution  of  the  corresponding  predictive 
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density  P^ty^lA^}.  Unusually  small  values  of  this  density 
suggest  that  the  current  model  is  open  to  question. 

D.  If  we  are  satisfied  with  the  adequacy  of  the 
assumptions  A^  then  the  posterior  distribution 
p(8|yd,A^)  allows  for  complete  estimation  of  0  and  no 
other  procedures  of  estimation  are  relevant.  In  particular, 
therefore,  insofar  as  shrinkage,  ridge  and  robust  estimators 
are  useful,  they  ought  to  be  direct  consequences  of  an 
appropriate  model  and  should  not  need  the  invocation  of 
extraneous  considerations  such  as  minimization  of  mean 
square  error. 

Objections.  Numbered  to  correspond  with  the  various 
elements  of  the  argument  are  responses  to  some  objections 
that  have  been,  or  might  be,  raised. 

A(i)  Iterative  investigation?  Some  would  protest  that 
their  own  statistical  experience  is  not  with  iterative 
investigation  but  with  a  single  set  of  data  to  be  analyzed, 
or  a  single  design  to  be  laid  out  and  the  results 
elucidated. 

Many  circumstances  where  the  statistican  has  been 
involved  in  a  "one-shot"  analysis  rather  than  an  iterative 
partnership,  ought  not  to  have  happened.  Such  involvement 
frequently  occurs  when  the  statistician  has  been  drafted  as 
a  last  resort,  all  other  attempts  to  make  sense  of  the  data 
having  failed.  At  this  point  data  gathering  will  usually 
have  been  completed  and  there  is  no  chance  of  influencing 
the  course  of  the  study.  Statisticians  whose  training  has 
not  exposed  them  to  the  overriding  importance  of 
experimental  design  are  most  likely  to  acquiesce  in  this 
situation,  or  even  to  think  of  it  as  normal,  and  thus  to 
encourage  its  continuance. 

The  statistician  who  has  cooperated  in  the  design  of  a 
single  experiment  which  he  analyzes  is  somewhat  better 
off.  However  one-shot  designs  are  often  inappropriate 
also.  Underlying  most  investigations  is  a  budget,  stated  or 
unstated,  of  time  and/or  money  that  can  reasonably  be 
expended.  Sometimes  this  latent  budget  is  not  adequate  to 
the  goal  of  the  investigation,  but,  for  purposes  of 
discussion,  let  us  suppose  that  it  is.  Then  if  a 


sequential/ iterative  approach  is  possible  it  would  usually 
be  quite  inappropriate  to  plan  the  whole  investigation  at 
the  beginning  in  one  large  design.  This  is  because  the 
results  from  a  first  design  will  almost  invariably  supply 
new  and  often  unexpected  information  about  choice  of 
variables,  metrics,  transformations,  regions  of  operability, 
unexpected  side-effects,  and  so  forth,  which  will  vitally 
influence  the  course  of  the  investigation  and  the  nature  of 
the  next  experimental  arrangement.  A  rough  working  rule  is 
that  not  more  than  25%  of  the  time-and-money  budget  should 
be  spent  on  the  first  design.  Because  large  designs  can  in 
a  limited  theoretical  sense  be  more  efficient  it  is  a  common 
mistake  not  to  take  advantage  of  the  iterative  option  when 
it  is  available.  Instances  have  occurred  of  experimenters 
regretting  that  they  were  persuaded  by  an  inexperienced 
statistician  to  perform  a  large  "all  inclusive"  design  where 
an  adaptive  strategy  would  have  been  much  better.  In 
particular,  it  is  likely  that  many  of  the  runs  from  such 
"all-embracing"  designs,  will  turn  out  to  be  noninformative 
because  their  structure  was  decided  when  least  was  known 
about  the  problem. 

Scientific  iteration  is  strikingly  exemplified  in 
response  surface  studies  (see,  for  example.  Box  and  Wilson 
[1951],  Box  [1954],  Box  and  Youle  [1955]).  In  particular 
methods  such  as  steepest  ascent  and  canonical  analysis  can 
lead  to  exploration  of  new  regions  of  the  experimental 
space,  requiring  elucidation  by  new  designs  which,  in  turn, 
can  lead  to  the  use  of  models  of  higher  levels  of 
sophistication.  Although  in  these  examples  the  necessity 
for  such  an  iterative  theory  is  most  obvious,  it  clearly 
exists  much  more  generally,  for  example  in  investigations 
employing  sequences  of  orthodox  experimental  designs  and  to 
many  applications  of  regression  analysis.  It  has  sometimes 
been  suggested  that  agricultural  field  trials  are  not 
sequential  but  of  course  this  is  not  so;  only  the  time  frame 
is  longer.  Obviously  what  is  learned  from  one  year's  work 
is  used  to  design  the  next  year’s  experiments. 

However  I  agree  that  there  are  some  more  convincing 
exceptions.  For  example,  a  definitive  trial  which  is 


intended  to  settle  a  controversy  such  as  a  test  of  the 
effectiveness  of  Laetrile  as  a  cure  for  cancer.  Also  the 
iteration  can  be  very  slow.  For  example,  in  trials  on  the 
weathering  of  paints,  each  phase  can  take  from  5-10  years. 

A(ii)  Subjective  probability?  The  view  of  the  process 
of  scientific  investigation  as  one  of  model  evolution  has 
consequences  concerning  subjective  probabilities.  An 
objection  to  a  subjectivist  position  is  that  in  presenting 
the  final  results  of  our  investigation,  we  need  to  convince 
the  outside  world  that  we  have  really  reached  the  conclusion 
that  we  say  we  have.  It  is  argued  that,  for  this  purpose, 
subjective  probabilities  are  useless.  However  I  believe 
that  the  confirmatory  stage  of  an  iterative  investigation, 
when  it  is  to  be  demonstrated  that  the  final  destination 
reached  is  where  it  is  claimed  to  be,  will  typically  occupy, 
perhaps,  only  the  last  5  per  cent  of  the  experimental 
effort.  The  other  95  per  cent  -  the  wandering  journey  that 
has  finally  led  to  that  destination  -  involves,  as  I  have 
said,  many  heroic  subjective  choices  (what  variables?  what 
levels?  which  scales?  etc.,  etc.)  at  every  stage.  Since 
there  is  no  way  to  avoid  these  subjective  choices  which  are 
a  major  determinant  of  success  why  should  we  fuss  over 
subjective  probability? 

Of  course,  the  last  5  per  cent  of  the  investigation 
occurs  when  most  of  the  problems  have  been  cleared  up  and  we 
know  most  about  the  model.  It  is  this  rather  minor  part  of 
the  process  of  investigation  that  has  been  emphasized  by 
hypothesis  testers  and  decision  theorists.  The  resultant 
magnification  of  the  importance  of  formal  hypothesis  tests 
has  inadvertently  led  to  underestimation  by  scientists  of 
the  area  in  which  statistical  methods  can  be  of  value  and  to 
a  wide  misunderstanding  of  their  purpose.  This  is  often 
evidenced  in  particular  by  the  attitudes  to  statistics  of 
editors  and  referees  of  journals  in  the  social,  medical  and 
biological  sciences. 

B(i)  The  Statistical  Model?  The  statistical  model  has 
sometimes  been  thought  of  as  the  density  function  p(y|9,A) 
rather  than  the  joint  density  p(y»£lA)  which  reflects  the 
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influence  of  the  prior.  However  only  the  latter  form 
contains  all  currently  entertained  beliefs  about  y  and 
0.  It  seems  quite  impossible  to  separate  prior  belief  from 
assumptions  about  model  structure.  This  is  evidenced  by  the 
fact  that  assumptions  are  frequently  interchangeable  between 
the  density  p(y|0)  and  the  prior  p(0).  As  an  elementary 
example,  suppose  that  among  the  parameters  8  =  ($»8)  of  a 
class  of  distributions  8  is  a  shape  parameter  such  that 
p(yljfe»8Q)  is  the  normal  density.  Then  it  may  be 
convenient,  for  example  in  studies  of  robustness,  to  define 
a  normal  distribution  by  writing  the  more  general  density 
p(yl£#8)  with  an  associated  prior  for  8  which  can  be 
concentrated  at  8  »  0Q.  The  element  specifying  normality 
which  in  the  usual  formulation  is  contained  in  the  density 
p(y,0 )  is  thus  transferred  to  the  prior  p(0). 

B(ii)  Do  we  need  a  prior?  Another  objection  to  the 
proposed  formulation  of  the  model  is  the  standard  protest  of 
non-Bayes ians  concerning  the  introduction  of  any  prior 
distribution  as  an  unnecessary  and  arbitrary  element. 
However,  recent  history  has  shown  that  it  is  the  omission  in 
sampling  theory,  rather  than  the  inclusion  in  Bayesian 
analysis,  of  an  appropriate  prior  distribution,  that  leads 
to  trouble. 

For  instance  Stein's  result  [1955]  concerning  the 
inadmissibility  of  the  vector  of  sample  averages  as  an 
estimate  of  the  mean  of  a  multivariate  normal  distribution 
is  well  known.  But  consider  its  practical  implication  for, 
say,  an  experiment  resulting  in  a  one-way  analysis  of 
variance.  Such  an  experiment  could  make  sense  when  it  is 
conducted  to  compare,  for  example,  the  levels  of  infestation 
of  k  different  varieties  of  wheat,  or  the  numbers  of  eggs 
laid  by  k  different  breeds  of  chickens  or  the  yields  of 
k  successive  batches  of  chemical;  in  general,  that  is, 
when  a  priori  we  expect  simi larities  of  one  kind  or  another 
between  the  entities  compared.  But  clearly,  if  similarities 
are  in  mind,  they  ought  not  to  be  denied  by  the  form  of  the 
model.  They  are  so  denied  by  the  improper  prior  which 
produces  as  Bayesian  means  the  sample  averages,  which  are  in 
turn  the  orthodox  estimates  from  sampling  theory. 
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Now  the  reason  that  k  wheat  varieties,  k  chicken 
breeds  or  k  batch  yields  are  being  jointly  considered  is 
because  they  are,  in  one  sense  or  another,  comparable.  The 
presence  of  a  specific  form  of  prior  distribution  allows  the 
investigator  to  incorporate  in  the  model  precisely  the  kind 
of  similarities  he  wishes  to  entertain.  Thus  in  the 
comparison  of  varieties  of  wheat  or  of  breeds  of  chicken  it 
might  well  be  appropriate  to  consider  the  variety  means  as 
randomly  sampled  from  some  prior  super-population  and,  as  is 
well  known,  this  can  produce  the  standard  shrinkage 
estimators  as  Bayesian  means  (Lindley  [1965],  Box  and  Tiao 
[1968],  Lindley  and  Smith  [1972]).  But  notice  that  such  a 
model  is  likely  to  be  quite  inappropriate  for  the  yields 
of  k  successive  batches  of  chemical.  These  mean  yields 
might  much  more  reasonably  be  regarded  as  a  sequence  from 
some  autocorrelated  time  series.  A  prior  which  reflected 
this  concept  led  Tiao  and  Ali  [1971]  to  functions  for  the 
Bayesian  means  which  are  quite  different  from  the  orthodox 
shrinkage  estimators . 

In  summary,  then,  both  sampling  theory  and  Bayes  theory 
can  rationalize  the  use  of  shrinkage  estimators,  and  the 
fact  that  the  former  does  so  merely  on  the  basis  of 
reduction  of  mean  square  error  with  no  overt  use  of  a  prior 
distribution,  at  first  seems  an  advantage.  However,  only 
the  explicit  inclusion  of  a  prior  distribution,  which 
sensibly  describes  the  situation  we  wish  to  entertain,  can 
tell  us  what  is  the  appropriate  function  to  consider,  and 
avoid  the  manifest  absurdities  which  seem  inherent  in  the 
sampling  theory  approach  which  implies,  for  example,  that  we 
can  improve  estimates  by  considering  as  one  group  varieties 
of  wheat,  breeds  of  chicken,  and  batches  of  chemical. 

C { i )  Is  there  an  iterative  interplay  between  criticism 
and  estimation?  A  good  example  of  the  iterative 
interplay  between  criticism  and  estimation  is  seen  in 
parametric  time  series  model  building  as  described  for 
example  by  Box  and  Jenkins  [1970].  Critical  inspection  of 
the  plotted  time  series  and  of  the  cor res£>onding  plotted 
autocorrelation  function,  and  other  functions  derivable  from 
it,  together  with  their  rough  limits  of  error,  can  suggest  a 
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model  specification  and  in  particular  a  parametric  model. 
Temporarily  behaving  as  if  we  believed  this  specification, 
we  may  now  estimate  the  parameters  of  the  time  series  model 
by  their  Bayesian  posterior  distribution  (which,  for  samples 
of  the  size  usually  employed,  is  sufficiently  well  indicated 
by  the  likelihood).  The  residuals  from  the  fitted  model  are 
now  similarly  critically  examined,  which  can  lead  to 
respecification  of  the  model,  and  so  on.  Systematic 
liquidation  of  serial  dependence  brought  about  by  such  an 
iteration  can  eventually  produce  a  parametric  time  series 
model;  that  is  a  linear  filter  which  approximately 
transforms  the  time  series  to  a  white  noise  series.  Anyone 
who  carries  through  this  process  must  be  aware  of  the  very 
different  nature  of  the  two  inferential  processes  of 
criticism  and  estimation  which  are  used  in  alternation  in 
each  iterative  cycle. 

C(ii)  Why  can't  all  criticism  be  done  using  Bayes 
posterior  analysis? 

It  is  sometimes  argued  that  model  checking  can  always 
be  performed  as  follows:  let  A.^ , A2»  •  •  •  * A^  be  alternative 
assumptions;  then  the  computation  of 


p(y|A. )p(A. ) 

p(Aily)  =  -^ - i - - -  (i  =  1,2, ...,k)  (25) 

I  p(*Ia.)p(a.) 
j=l  3  3 


yields  the  probabilities  for  the  various  sets  of 
assumptions. 

The  difficulty  with  this  approach  is  that  by  supposing 
all  possible  sets  of  assumptions  known  a  priori  it 
discredits  the  possibility  of  new  discovery.  But  new 
discovery  is,  after  all,  the  most  important  object  of  the 
scientific  process. 

At  first,  it  might  be  thought  that  the  use  of  (25)  is 
not  misleading,  since  it  correctly  assesses  the  relative 
plausibility  of  the  models  considered.  But  in  practice  this 
would  seem  of  little  comfort.  For  example  suppose  that 
only  k  =  3  models  are  currently  regarded  as  possible,  and 
that  having  collected  some  data  the  posterior  probabilities 
p(A^|y)  are  0.001,  0.001,  0.998  (i  =  1,2,3).  Although 
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in  relation  to  these  particular  alternatives  p^^ly)  is 
overwhelmingly  large  this  does  not  necessarily  imply  that  in 
the  real  world  assumptions  A3  could  be  safely  adopted. 

For,  suppose  unknown  to  the  investigator,  a  fourth 
possibility  A4  exists  which  given  the  data  is  a  thousand 
times  more  probable  than  the  group  of  assumptions  previously 
considered.  Then,  if  that  model  had  been  included,  the 
probabilities  would  be  0.000,001,  0.000,001,  0.000,998,  and 
0.999,000. 

Furthermore,  in  ignorance  of  A  4  it  is  highly  likely 
that  a  study  of  the  components  of  the  predictive 
distribution  p(y|A.j)  and  in  particular  of  the  residuals, 
could  (a)  have  shown  that  A3  was  not  acceptable  and  (b) 
have  provided  clues  as  to  the  identity  of  A4.  The 
objective  of  good  science  must  be  to  conjure  into  existence 
what  has  not  been  contemplated  previously.  A  Bayesian 
theory  which  excludes  this  possibility  subverts  the 
principle  aim  of  scientific  investigation. 

More  generally,  the  possibility  that  there  are  more 
than  one  set  of  assumptions  that  may  be  considered,  merely 
extends  the  definition  of  the  model  to 

p(y,9,Aj)  -  ptylOAjJpte lAjJptAj)  (j  =  1,2,.. .,k) 

which  in  turn  will  yield  a  predictive  distribution.  In  a 
situation  when  this  more  general  model  is  inadequate  a 
mechanical  use  of  Bayes  theorem  could  produce  a  misleading 
analysis,  while  suitable  inspection  of  predictive  checks 
could  have  demonstrated,  on  a  sampling  theory  argument,  that 
the  global  model  was  almost  certainly  wrong  and  could  have 
indicated  possible  remedies. 

C(iii)  An  abrogation  of  the  likelihood  principle?  The 
likelihood  principle  holds,  of  course,  for  the  estimation 
aspect  of  inference  in  which  the  model  is  temporarily 
assumed  true.  however  it  is  inapplicable  to  the  criticism 
process  in  which  the  model  is  regarded  as  in  doubt. 


^1  am  grateful  to  Dr.  Michael  Titterington  for  pointing  out  that  in 
discriminant  analysis  the  atypicality  indices  of  Aitchison  and  Aitken 
[1976]  use  similar  ideas. 
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If  the  assumptions  A  are  supposed  true,  the 
likelihood  function  contains  all  the  information  about  0 
coming  from  the  particular  observed  data  vector  y^.  When 
combined  with  the  prior  distribution  for  0  it  therefore 
tells  all  we  can  know  about  8  given  yd  and  A*  In  8U°h 
a  case  the  predictive  density  p  I  A.)  can  tell  us  nothing 
we  have  not  already  assumed  to  be  true,  and  will  fall  within 
a  given  interval  with  precisely  the  frequency  forecast  by 
the  predictive  distribution.  When  the  assumptions  are 
regarded  as  possibly  false,  however,  this  will  no  longer  be 
true  and  information  about  model  inadequacy  can  be  supplied 
by  considering  the  density  p(y^|A)  in  relation  to 
p(y|A) .  Thus  for  the  Normal  linear  model,  the  distribution 
of  residuals  contains  no  information  if  the  model  is  true, 
but  provides  the  reference  against  which  standard  residual 
checks,  graphical  and  otherwise,  are  made  on  the  supposition 
that  it  may  be  importantly  false. 

In  the  criticism  phase  we  are  considering  whether, 
given  A,  the  sample  yd  is  likely  to  have  occurred  at 
all.  To  do  this  we  must  consider  it  in  relation  to  the 
other  samples  that  could  have  occurred  but  did  not. 

For  instance  in  the  Bernoulli  trial  example,  had  we 
sampled  until  we  had  r  successes  rather  than  until  we 
had  n  trials,  then  the  likelihood,  and,  for  a  fixed  prior, 
the  posterior  distribution,  would  have  been  unaffected,  but 
the  predictive  check  would  (appropriately)  have  been 
somewhat  different  because  the  appropriate  reference  set 
supplied  by  p(y|A)  would  be  different. 

C(iv)  How  do  you  choose  the  significance  level? 

It  has  been  argued  that  if  significance  tests  are  to  be 
employed  to  check  the  model,  then  it  is  necessary  to  state 
in  advance  the  level  of  significance  a  which  is  to  be  used 
and  that  no  rational  basis  exists  for  making  such  a  choice. 

While  I  believe  the  ultimate  justification  of  model 
checking  is  the  reference  of  the  checking  function  to  its 
appropriate  predictive  distribution,  the  examples  I  have 
given  to  illustrate  the  predictive  check  may  have  given  a 
misleading  idea  of  the  formality  with  which  this  should  be 
done.  In  practice  the  predictive  check  is  not  intended  as  a 


formal  test  in  the  Neyman-Pearson  sense  but  rather  as  a 
rough  assessment  of  signal  to  noise  ratio.  It  is  needed  to 
see  which  indications  might  be  worth  pursuing.  In  practice 
model  checks  are  frequently  graphical,  appealing  as  they 
should  to  the  pattern  recognition  capability  of  the  right 
brain.  Examples  are  to  be  found  in  the  Normal  probability 
plots  for  factorial  effects  and  residuals  advocated  by 
Daniel  [1959],  Atkinson  [1973]  and  Cook  [1977].  Because 
spurious  patterns  may  often  be  seen  in  noisy  data  some  rough 
reference  of  the  pattern  to  its  noise  level  is  needed. 

D.  As  might  be  expected  the  mistaken  search  for  a 
single  principle  of  inference  has  resulted  in  two  kinds  of 
incongruity: 

attempts  to  base  estimation  on  sampling  theory,  using 
point  estimates  and  confidence  intervals;  and 
attempts  to  base  criticism  and  hypothesis  testing 
entirely  on  Bayesian  theory. 

The  present  proposals  exclude  both  these  possibilities  . 

Concerning  estimation,  we  will  not  here  recapitulate 
the  usual  objections  to  confidence  intervals  and  point 
estimates  but  will  consider  the  latter  in  relation  to 
shrinkage  estimators,  ridge  estimators,  and  robust 
estimators.  From  the  traditional  sampling  theory  point  of 
view  these  estimators  have  been  justified  on  the  ground  that 
they  have  smaller  mean  square  error  then  traditional 
estimators.  But  from  a  Bayesian  viewpoint,  they  come  about 
as  a  direct  result  of  employing  a  credible  rather  than  an 
incredible  model.  The  Bayes’  approach  provides  some 
assurance  against  incredibility  since  it  requires  that  all 
assumptions  of  the  model  be  clearly  visible  and  available 
for  criticism. 

For  illustration,  emphasized  below  by  underlining,  are 
the  assumptions  that  would  be  needed  for  a  Bayesian 
justification  of  standard  linear  least  squares.  We  must 
postulate  not  only  the  model 

y  =  8'x  +  e  u  =  1,2, ...,n  (26) 

J  u  -  ~u  u 

with  the  eu's  independently  and  normally  distributed  with 
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constant  variance  o  ,  but  also  postulate  an  Imprcyer  prior 
2 

for  0  and  a  . 

•># 

(a)  Consider  first  the  choice  of  prior.  As  was 
pointed  out  by  Anscombe  [1963],  if  we  use  a  measure  such  as 

0'0  to  gauge  the  size  of  the  parameters,  a  locally  flat 
prior  for  8  implies  that  the  larger  is  the  size  measure 
0 '8  the  more  probable  it  becomes.  The  model  is  thus 
incredible.  From  a  Bayesian  viewpoint  shrinkage  and  ridge 
estimators  imply  more  credible  choices  of  the  model,  which, 
even  though  approximate  are  not  incredible. 

(b)  For  data  collected  serially  (in  particular,  for 
much  economic  data)  the  assumption  of  error  independence  in 
equation  (26)  is  equally  incredible  and  again  its  violation 
can  lead  to  erroneous  conclusions.  See  for  example  Coen, 
Gomme  and  Kendall  [1969]  and  Box  and  Newbold  [1971]. 

(c)  The  assumption  that  the  specification  in  (26)  is 
necessarily  appropriate  for  every  subscript  u  =  l,2,...,n 
is  surely  incredible.  For  it  implies  that  the  experi¬ 
menter's  answer  to  the  question  "Could  there  be  a  small 
probability  (such  as  0.001)  that  any  one  of  the  experimental 
runs  was  unwittingly  misconducted?"  is  "No;  that  probability 
is  exactly  zero." 

So  far  as  the  last  assumption  is  concerned  a  more 
credible  model  considered  by  Jeffreys  [1932],  Dixon 
[1953],  Tukey  [1960]  and  Box  and  Tiao  [1968]  supposes  that 
the  error  e  is  distributed  as  a  mixture  of  Normal 
distributions 

p(e|0,o)  *  (1  -  a ) f ( e | 0 , o)  +  af ( e | 0 , ka )  .  (27) 

This  model  was  used  by  Bailey  and  Box  [1980]  to  estimate  the 
15  coefficients  in  the  fitted  model 


y  = 


9o  *1. Vi  *1,  ,£.8ijxixj  +  Z  9iixi  + 
1=1  1=1  1>1  J  J  1=1 


(28) 


using  data  from  a  balanced  incomplete  3^  factorial  design. 
Table  1  shows  some  of  their  Bayes'  estimates  (marginal  means 
and  standard  deviations  of  the  posterior  distribution).  For 
simplicity,  only  a  few  of  the  coefficients  are  shown;  the 
behaviour  of  the  others  is  similar.  Table  la  uses  data  from 
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a  paper  by  Box  and  Behnken  [I960].  These  data  (see  Figure  3) 
apparently  contain  a  single  bad  value  (y^),  a  small 

possibility  of  a  second  bad  value  (y^).  Table  lb  shows 
the  same  analysis  for  a  second  set  of  data  arising  from  the 
same  design  and  published  by  Bacon  [1970],  which  (see  Figure 
4)  appears  to  contain  no  bad  values.  It  was  shown  by  Chen 
and  Box  [1979]  that  for  k  >  5  the  posterior  distribution 
of  £  is  mainly  a  function  of  the  single  parameter 
e  *  a/(l-a)k  and  the  results  obtained  for  k  ■  5  are 
labelled  in  terms  of  e  as  well  as  a.  The  analysis  is 
based  on  locally  noninformat ive  priors  on  £  and  on  log  o 
so  that  the  estimates  in  the  first  columns  of  the  tables 
(e  **  o  =  0)  are  ordinary  least  squares  estimates.  The 
important  point  to  notice  is  that  for  the  first  set  of  data 
which  appears  to  contain  one  or  two  bad  values,  a  major 
change  away  from  the  least  squares  estimates  can  occur  as 
soon  as  there  is  even  a  slight  hint  (e  =  0.001,  a  =  0.005) 
of  the  possibility  of  contamination.  The  estimates  then 
remain  remarkably  stable  for  widely  different  values  of  e 
over  a  plausible  range. But  for  the  second  set  (Bacon's 
data),  which  appears  to  contain  no  bad  values,  scarcely  any 
change  occurs  at  all  as  e  is  changed. 

It  has  been  objected  that  while  the  Normal  model  is 
inadequate,  the  contaminated  model  (27)  may  be  equally  so, 
and  that  "therefore"  we  are  better  off  using  ad  hoc  robust 
procedures  such  as  have  been  recommended  by  Tukey  and  others 
and  justified  on  the  basis  of  their  sampling  properties. 

This  argument  loses  force,  however,  since  it  can  be  shown  by 
elementary  examples  (Chen  and  Box  [1979],  Box  [1980])  that 
the  effect  of  the  Bayes'  analysis  is  also  to  produce 
downweighting  of  the  observations  with  downweighting 
functions  very  similar  to  those  proposed  by  the 
empiricists.  However,  the  Bayes'  analysis  has  the  advantage 
of  being  based  on  a  visible  model  which  is  itself  open  to 
criticism  and  has  greater  adaptivity,  doing  nothing  to 


(^They  are  however  (see  reply  to  the  discussion  of  Box  [1980]) 
considerably  different  from  estimates  obtained  by  omitting  the  suspect 
observation  and  using  ordinary  least  squares. 
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Figure  3.  Posterior  probability  that  yu  is 
bad  given  that  one  observation  is 
bad  (Box-Behnken  data) . 


U 


Figure  4.  Posterior  probability  that  yu  is 
bad  given  that  one  observation  is 
bad  (Bacon  data). 


samples  that  look  normal,  and  reserving  robustif ieation  for 
samples  that  do  not.  A  further  advantage  of  the  present 
point  of  view  is  that  when  an  outlier  occurs,  while  the 
posterior  distribution  will  discount  it,  the  predictive 
distribution  will  emphasize  it,  so  that  the  fact  that  a 
discrepancy  has  occurred  is  not  lost  sight  of. 

CONCLUSION. 

In  summary  I  believe  that  scientific  method  employs  and 
requires  not  one,  but  two  kinds  of  inference  -  criticism  and 
estimation;  once  this  is  understood  the  statistical  advances 
made  in  recent  years  in  Bayesian  methods,  data  analysis, 
robust  and  shrinkage  estimators  can  be  seen  as  a  cohesive 
whole. 
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